Skills: Describing variables

Week 2

This page illustrates some skills that will be useful to you as you learn to describe variables in a dataset

For the R examples, I’ll be using the following packages and load a dataset called my_data.

## Warning: package 'ggplot2' was built under R version 4.3.3

Descriptive statistics

You may want to describe your data with descriptive statistics.

Mean

The mean is the average value of a variable. It is a measure of central tendency that is sensitive to outliers.

R

The function in R that will calculate a mean is called mean().

mean(my_data$price)
## [1] 1043.235

Excel

In Excel, the function is AVERAGE().

Standard deviation

The standard deviation is a measure of how spread out the data are. It is sensitive to outliers.

R

In R, the function sd() will calculate a standard deviation.

sd(my_data$price)
## [1] 260.6514

Excel

In Excel, the function to calculate a standard deviation is STDEV.P().

Median and other percentiles

Percentiles are the values that a specified percentage of observations of a variable are lower than. The 50th percentile is also called the median: It’s the value that half of all observations are less than.

The median is measure of central tendency that is not sensitive to outliers.

The range between lower and upper percentiles can be a useful measure of central tendency. For example, the interquartile range (IQR) is the range between the 25th and 75th percentile. Half of all values fall within this range.

R

In R, you can calculate a percentile with the function quantile(). The median is the 50th percentile.

quantile(my_data$price, probs = 0.5)
##      50% 
## 1040.074

You can also calculate the 25th and 75th percentiles.

quartiles <- quantile(my_data$price, probs = c(0.25, 0.75))

quartiles
##       25%       75% 
##  866.5128 1222.3472

And the difference between them is the interquartile range.

IQR = quartiles["75%"] - quartiles["25%"]

as.numeric(IQR)
## [1] 355.8344

Excel

In Excel, the function to calculate percentiles is PERCENTILE.EXC().

Proportions

Averages and standard deviations don’t make sense for categorical variables. The appropriate descriptive statistics would be proportions.

R

In R, you can use the table() function to get the number of observations in each category.

table(my_data$school)
## 
##    East Edgerly  Healey Kennedy    West 
##    5725     327     218    2259    1471

And then you can express these as proportions by dividing those numbers by the number of observations.

table(my_data$school) / nrow(my_data)
## 
##    East Edgerly  Healey Kennedy    West 
##  0.5725  0.0327  0.0218  0.2259  0.1471

Excel

If you’ve set up a column for each category, where a value of 1 means the observation is in that category and a value of zero means it is not, the average of one of those columns is the proportion of observations within that category.

You can use an IF() function in Excel to convert categories to zeros and ones.

Confidence intervals

In general, the data sets you work with represent samples of larger populations. Your goal is to be able to generalize from your sample to the population.

Sometimes this is literally and obviously true. For example, you might have sent a survey to one percent of all Canadian households, and you want to analyze the data to make some claims about the characteristics of Canadian households in general.

Other times, describing your sample in terms of the population it comes from is not as obvious. For example, you might have data about every county in the United States suggesting that incomes are higher in urban counties than in rural counties. You might argue that this is not a sample, because you have data on every county that exists. However, if you want to claim that this difference is based on a fundamental difference that is likely to always exist between urban and rural counties in the United States, then you need to be able to generalize this difference to a broader population, which would be all hypothetical counties that could ever exist in the United States.

Confidence intervals for averages

A confidence interval for an average is a range of values that you can have a specified level of confidence that the the average value for the full population falls within. A one-sample t-test is a method for calculating a confidence interval.

R

Here is how you would do a one-sample t-test in R.

t.test(my_data$price)
## 
##  One Sample t-test
## 
## data:  my_data$price
## t = 400.24, df = 9999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  1038.126 1048.344
## sample estimates:
## mean of x 
##  1043.235

By default, the t.test function returns a 95-percent confidence interval, but you can also specify other confidence levels. Here is how you would get the 99-percent confidence level.

t.test(my_data$price,
       conf.level = 0.99)
## 
##  One Sample t-test
## 
## data:  my_data$price
## t = 400.24, df = 9999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 99 percent confidence interval:
##  1036.52 1049.95
## sample estimates:
## mean of x 
##  1043.235

Excel

To calculate the 95-percent confidence interval in Excel, you’d need to know that the confidence margin is the standard deviation times the z-score corresponding to the confidence level you want.

So, first you calculate the standard deviation of the sample.

Then you calculate the z-score for the desired confidence level using the function NORM.INV.S(). The probability you should use for a 95% confidence level is 0.975. Or, if you use a probability of 0.025, you’ll get the opposite (i.e. negative) of the z-score you need.

Now you can calculate the upper and lower bounds of the confidence interval.

Confidence intervals for proportions

Confidence intervals for proportions work the same way as confidence intervals for averages, but you use the standard error of the proportion instead of the standard deviation.

R

In R, you use prop.test rather than t.test when you want to find the confidence interval for a proportion.

prop.test(x = sum(my_data$school == "East"),
          n = nrow(my_data))
## 
##  1-sample proportions test with continuity correction
## 
## data:  sum(my_data$school == "East") out of nrow(my_data), null probability 0.5
## X-squared = 209.96, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.5627276 0.5822164
## sample estimates:
##      p 
## 0.5725

Excel

To calculate the 95-percent confidence interval in Excel, you’d need to know that the confidence margin is the standard error times the z-score corresponding to the confidence level you want.

The standard error is given by the equation:

\[ S.E. = \sqrt{\frac{p\times(1-p)}{n}} \]

Where p is the proportion and n is the number of observations.

Then you calculate the z-score and the upper and lower bounds of the confidence interval as you would for a continuous variable.

Visualizing data

You might find it useful to visualize a variable in your dataset using a histogram, and one-dimensional scatterplot, or a bar chart.

Histogram

A histogram is a useful way to visualize the distribution of a continuous variable.

R

In R (ggplot2), geom_histogram will generate a histogram.

Here, I’ll visualize the distribution of a variable in my dataset called “store size.”

ggplot(my_data) +
  geom_histogram(aes(x = price),
                 bins = 40) +
  theme_linedraw()

Excel

You can create a histogram in Excel using the data analysis tool pack. By default, it will put gaps between the bars. It is more common to show histograms without these gaps, so you can remove them.

One-dimesional scatterplot

A one-dimentional scatteplot can be a much more literal representation of the the distribution of a dataset.

R

In R, you can use geom_jitter() to generate a scattplot. You can set the x-coordinate as the variable you want to visualize, and set the y-coordinate as a constant, and R will randomly vary the y positions.

You might want to turn off the y-axis labels since they’ll be meaningless.

You might also want to make the points a little bit transparent to avoid overplotting.

ggplot(my_data) +
  geom_jitter(aes(x = price, y = 0),
              alpha = 0.1) +
  theme_linedraw() +
  theme(axis.text.y = element_blank(),
        axis.title.y = element_blank(),
        axis.ticks.y =  element_blank(),
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank())

Excel

In Excel, to acheive that same effect, you need to create a fake y-variable that varies randomly.

Then you can create a scatterplot.

Bar charts

You might use a bar chart to represent proportions.

R

In R, you can use geom_bar() to represent proportions with a stacked bar chart.

ggplot(my_data) +
  geom_bar(aes(fill = school, x = 1),
           position = "fill") +
  scale_y_continuous(name = "Proportion",
                     breaks = c(0.2,
                                0.4,
                                0.6,
                                0.8,
                                1),
                     labels = c("20%",
                                "40%",
                                "60%",
                                "80%",
                                "100%")) +
  theme_linedraw() +
  theme(axis.text.x = element_blank(),
        axis.title.x = element_blank(),
        axis.ticks.x =  element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank())

ggplot(my_data) +
  geom_bar(aes(x = school)) +
  scale_y_continuous(breaks = breaks <- c(2000, 4000, 6000),
                     labels = paste0(100*breaks / nrow(my_data), "%"),
                     name = "Percentage") +
  theme_linedraw() 

You can also calculate the proportions and their confidence intervals and make a little table for them like this.

prop_summary <-my_data |>
  mutate(school_east = school == "East",
         school_edgerly = school == "Edgerly",
         school_healey = school == "Healey",
         school_kennedy = school == "Kennedy",
         school_west = school == "West") |>
  select(school_east, 
         school_edgerly, 
         school_healey,
         school_kennedy,
         school_west) |>
  pivot_longer(cols = everything(),
               names_to = "which_school",
               values_to = "is_school") |>
  group_by(which_school) |>
  summarise(Proportion = mean(is_school),
            `C.I. Low` = prop.test(x = sum(is_school==1),
                                   n = n(), 
                                   conf.level = 0.95)$conf.int[1],
            `C.I. High` = prop.test(x = sum(is_school==1),
                                   n = n(), 
                                   conf.level = 0.95)$conf.int[2])

prop_summary
## # A tibble: 5 × 4
##   which_school   Proportion `C.I. Low` `C.I. High`
##   <chr>               <dbl>      <dbl>       <dbl>
## 1 school_east        0.572      0.563       0.582 
## 2 school_edgerly     0.0327     0.0293      0.0364
## 3 school_healey      0.0218     0.0191      0.0249
## 4 school_kennedy     0.226      0.218       0.234 
## 5 school_west        0.147      0.140       0.154

And then we can use that to make a bar chart with error bars

ggplot(prop_summary,
       aes(x = which_school,
           y = Proportion)) +
  geom_bar(stat = "identity") +
  geom_errorbar(aes(ymin = `C.I. Low`,
                    ymax = `C.I. High`)) +
  theme_linedraw()